1 Exercise 1

1.1 Scrutinize the data to assess structure and quality. Are there any improbable or problematic entries? Provide a summary of checks performed and edit the data so entries are valid and meaningful where editing is reasonable to do.

The data set is quiet well structured and the quality of the data set is good. However, there are a few problematic entries and the following section will remove these entries and modify the data set such that the data can be used easily for further analysis.

  1. Though the data is meant to be collected from USA, there were a few entries where the company’s location was mentioned to be United Kingdom. Such entries were removed from the data set.
Job Title Location
Genomic Data Scientist Stevenage, United Kingdom
Scientist, Data, Methods and Analytics Immuno-inflammation and Specialty Medicines Stevenage, United Kingdom
Scientist in Data, Methods, & Analytics Brentford, United Kingdom
Lead Data Analyst Brentford, United Kingdom
  1. All the company names contained the ratings attached along with it eventhough it was provided in a separate column. Therefore, they were also removed.

  2. The Size of employees column was of type character, therefore, they were converted to factors and the levels were set accordingly.

  3. The Revenue column was all mentioned in USD, so the USD was removed from the columns and added to the column name.

  4. The salary estimate was very messy as it contained multiple factors/ranges and there were overlapping ranges too. The estimate contained different types such as Glassdoor estimate, employer estimate and per hour estimate. This has to be separated from the estimate value for easy data usage. The estimate ranges were reconstructed so that the number of different ranges are minimised.

  5. All the -1 values were converted to NAs

1.2 b. How many job listings provide salary (intervals) in a per hour basis?

There are 21 job listings that provide salary(intervals) on a per hour basis.

1.3 We want to investigate what the differences are between the job listings for those under different classification, i.e. business analytics, data analytics and data science. Compare across the classifications using appropriate graphics the:

1.3.1 salary intervals (study the minimum and maximum of the intervals)

Figure 1.1: Maximum and Minimum Salary comparison

Data Scientists have the highest Max salary limit and also the lowest Min Salary limit. This also shows how diverse the Data Scientist job classification can be.

1.3.2 location of the job (study by State)

Figure 1.2: Location of Business Analyst job by state

In USA, Business Analyst jobs are more popular in the state of Texas and California. The count seems to be significantly less in New York which is a very interesting observation.

Figure 1.3: Location of Data Analyst job by state

Compared to Business Analyst jobs, Data Analyst jobs are significantly lesser. Data Analyst Jobs are more popular in Texas, California and New York.

Figure 1.4: Location of Data Scientist job by state

The number of jobs for Data Scientists are comparatively higher when compared to Business and Data Analysts. This was also evident from the bar graph aove.

1.3.3 company size

Ratio of different company sizes for Business Analysts

Figure 1.5: Ratio of different company sizes for Business Analysts

Ratio of different company sizes for Data Analysts

Figure 1.6: Ratio of different company sizes for Data Analysts

Ratio of different company sizes for Data Scientists

Figure 1.7: Ratio of different company sizes for Data Scientists

The number of startups (having lesser employee count) are higher for Business Analyst field while comapred to the rest, while Data Scientists have more oppurtunities in larger companies.

1.3.4 Industry

Business Analyst in various Industries

Figure 1.8: Business Analyst in various Industries

Data Analyst in various Industries

Figure 1.9: Data Analyst in various Industries

Data Scientist in various Industries

Figure 1.10: Data Scientist in various Industries

Staff Outsourcing and IT services are the major industries where these 3 job classifications are predominant.

1.3.5 Sector

Data Scientist in various Sectors

Figure 1.11: Data Scientist in various Sectors

Data Analyst in various Sectors

Figure 1.12: Data Analyst in various Sectors

Business Analyst in various Sectors

Figure 1.13: Business Analyst in various Sectors

Information Technology and Business Services are the predominant sectors where wthese job classifications are required.

1.4 Your friend suspects that if an employer provides a salary range for the job, the salary is large and hence more attractive to potential candidates. Investigate this claim. Your investigation should be supported by graphics.

Maximum Salary vs Rating

Figure 1.14: Maximum Salary vs Rating

This claim seems to be true based on the above graph. As it can be seen, the job ratings get higher as the salary gets higher.

1.5 Is the location (via by State) associated with the salary and/or sector? Show graphics to best your conclusion.

Salary vs State

Figure 1.15: Salary vs State

The salary range in California, Texas and New York are comparitively higher when compared to the rest.

Figure 1.16: Sector vs State

The sector count is higher in Texas and California when compared to the rest. This mayb also be due to the number of listings that are more in number for these 2 states.